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ABSTRACT 

HAMAP (High-quality Automated and Manual 
Annotation of Proteins— available at http://hamap. 
expasy.org/) is a system for the classification and 
annotation of protein sequences. It consists of a 
collection of manually curated family profiles for 
protein classification, and associated annotation 
rules that specify annotations that apply to family 
members. HAMAP was originally developed to 
support the manual curation of UniProtKB/Swiss- 
Prot records describing microbial proteins. Here 
we describe new developments in HAMAP, 
including the extension of HAMAP to eukaryotic 
proteins, the use of HAMAP in the automated anno- 
tation of UniProtKB/TrEMBL, providing high-quality 
annotation for millions of protein sequences, and 
the future integration of HAMAP into a unified 
system for UniProtKB annotation, UniRule. HAMAP 
is continuously updated by expert curators with new 
family profiles and annotation rules as new protein 
families are characterized. The collection of HAMAP 
family classification profiles and annotation rules 
can be browsed and viewed on the HAMAP 
website, which also provides an interface to scan 
user sequences against HAMAP profiles. 

INTRODUCTION 

Falling costs and continuing technological improvements 
mean that genome sequencing has become a routine tool 



in life science research. The availability of thousands of 
finished genome sequences covering taxonomic ranges 
from individual strains to whole kingdoms has allowed 
biologists to ask new questions about the evolution of 
individual proteins, genomes and even species (1). 
Annotated genomes also provide an essential starting 
point in the construction of genome-scale models of 
cellular processes, particularly of cellular metabolism (2). 
These models may in turn serve as a framework for the 
iterative enhancement of genome annotation, providing 
contextual information that is complementary to the 
primary sequence and that can be used to infer potential 
new functions for uncharacterized genes (3). These and 
other applications are critically dependent on the quality 
of genome annotation, both of the predicted gene models, 
and of the functional assignments that are made to the 
putative gene products. 

Genome sequencing technologies are now within the 
reach of many individual research groups, meaning that 
the pace of data production, and subsequent submission 
to archival resources such as the International Nucleotide 
Sequence Database Collaboration (INSDC; composed of 
GenBank, the European Nucleotide Archive and the DNA 
Data Bank of Japan) (4) is unlikely to slow. Exploiting this 
data requires genome annotation that is as complete 
and accurate as possible, but providing this annotation 
remains a challenge. The development of shared 
standard operating procedures by the major sequencing 
centers (5) will undoubtedly improve the quality of the 
resulting archival annotations. These may be further 
enhanced by the provision of detailed functional annota- 
tion by third-party resources that can be updated on a 
regular basis as new knowledge becomes available. 
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One source of such annotation is the UniProt 
Knowledgebase, UniProtKB, a resource of protein 
sequences and associated functional information (6). 
UniProtKB is composed of two sections: UniProtKB/ 
Swiss-Prot, which includes records that have been 
manually reviewed and curated by a human curator, and 
UniProtKB/TrEMBL, which includes unreviewed 
records. UniProtKB sequences from both sections are 
classified by InterPro (7), which groups signatures for 
the identification of conserved protein domains and 
families from a number of resources, and which also 
provides functional annotation in the form of curated 
terms from the Gene Ontology (GO) (8,9). The InterPro 
classification has been exploited for the construction of 
annotation rules that link InterPro signatures and 
other information to relevant functional annotation 
from UniProtKB/Swiss-Prot (10-12). Other resources 
providing functional annotation include KEGG (13), 
MetaCyc (14) and the SEED (15), which combine 
curated reference data on metabolism with methods to 
'project' this data to new genomes. In the case of 
KEGG and the SEED, functions are inferred based on 
sequence homology, whereas the MetaCyc PathoLogic al- 
gorithm makes 'chained' inferences based on annotations 
in INSDC records (16). Another useful source of annota- 
tion for enzymes is PRIAM, which automatically 
identifies conserved sequence signatures in annotated 
enzymes from UniProtKB, and uses these signatures to 
identify and annotate uncharacterized homologs (17). 

Genome sequencing centers and other users rely on the 
information available in these and other systems to 
annotate new genomes and proteins. To enhance the pro- 
vision of such information in UniProtKB, we previously 
developed the HAMAP system (for High-quality 
Automated and Manual Annotation of Proteins) (18). 
HAMAP was originally designed to annotate protein se- 
quences from prokaryotic species to the quality standards 
required by UniProtKB/Swiss-Prot, exactly as a human 
curator would do, and was used in the construction and 
development of UniProtKB/Swiss-Prot (18). HAMAP is 
based on a collection of manually curated family profiles, 
which are used to determine family membership of protein 
sequences. HAMAP profiles are linked to manually 
curated annotation rules, which specify the annotation 
that can be applied to members of the protein family, 
and which include additional control statements that 
supervise the propagation of this annotation to member 
sequences. In the remainder of this article, we describe the 
current status and new developments in HAMAP, and 
briefly describe how HAMAP will be used to annotate 
UniProtKB in the future. 



HAMAP: A COLLECTION OF MANUALLY CURATED 
FAMILY PROFILES WITH ASSOCIATED 
ANNOTATION RULES 

HAMAP family profiles 

HAMAP family profiles are used to determine family 
membership of protein sequences. HAMAP profiles are 
automatically generated from manually curated seed 



alignments of trusted family member sequences. This set 
of trusted member sequences normally includes all 
characterized family members from UniProtKB/Swiss- 
Prot, plus a representative selection of other sequences 
that provide broad taxonomic coverage of the target 
family. Sequences are selected using iterative and recipro- 
cal BLAST searches (19), and the resulting sets are 
compared with those from other resources of protein 
families and homologs including HOGENOM (20), 
OrthoDB (21), TIGRFAMs (22), Pfam (23) and 
PROSITE (24). All protein sequences that are included 
in the seed alignment are manually checked, and where 
necessary corrected. This may typically involve rectifica- 
tion of erroneous start sites or erroneous gene model pre- 
dictions. These corrections are subsequently integrated 
into UniProtKB/Swiss-Prot, thereby guaranteeing that 
the corrected sequences remain fixed and synchronized 
with the HAMAP family profiles of which they are a 
member. 

Following the automatic generation of a detection 
profile from the seed alignment (25), the profile is 
calibrated using the standard PROSITE procedure (26). 
The profile is scanned against a database of randomized 
protein sequences from UniProtKB, and the parameters 
of an extreme value distribution are estimated from the 
score distribution obtained (26). These parameters are 
subsequently used in the normalization of the raw scores 
using an affine transformation (26). The normalized scores 
are related to the commonly used i?-value, which is the 
expected number of matches with a score equal to or 
greater than a given score that would be expected to 
arise by chance. For example, a match with a normalized 
score of 9.0 would be expected to occur roughly once in a 
database of one billion residues. 

During profile construction and calibration, all matches 
to the profile are extracted from UniProtKB and the 
lowest scoring member sequence of the seed alignment is 
used to define an initial threshold value (or trusted cutoff 
score) for the normalized scores to each profile. Curators 
can manually adjust this cutoff to include lower scoring 
member sequences, or raise it to reduce the possibility of 
false positive matches. Curators may also choose to alter 
the composition of the original seed alignment to enhance 
the specificity of the profile, performing iterative profile 
searches until a satisfactory score distribution is obtained. 

HAMAP annotation rules 

Each HAMAP family profile may be associated with one 
or more HAMAP annotation rules. When multiple rules 
are associated to a single profile, then each rule will 
normally apply to a distinct taxonomic group. HAMAP 
annotation rules define the relevant annotations for 
protein sequences that match the associated HAMAP 
profile, and are manually created using information 
from UniProtKB/Swiss-Prot entries. Annotations are 
provided in the form of free text, controlled vocabularies 
from UniProtKB, such as UniPathway (27), and terms 
from the GO (9). Typical annotations may describe 
protein function, enzymatic activities, subcellular 
location, and pathway membership, as well as specific 
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sequence features such as active sites and ligand-binding 
residues. Annotations may be subject to control 
statements that limit their propagation to only those se- 
quences satisfying one or more conditions, such as a re- 
quirement for the presence of specific conserved functional 
residues (18). 

RECENT DEVELOPMENTS IN HAMAP 

Automatic annotation of UniProtKB /TrEMBL 

HAMAP was originally developed as a tool for the 
annotation of microbial protein sequences to the same 
level of detail and to the same quality standards as 
manually curated UniProtKB/Swiss-Prot records (18). 
HAMAP was used to annotate UniProtKB/TrEMBL 
records, which were then carefully checked and integrated 
into UniProtKB/Swiss-Prot. Since our last publication in 
2009 describing the HAMAP classification and annotation 
system, we have made significant alterations to the way 
that HAMAP is used during the UniProtKB curation and 
production process. HAMAP family profiles have now 
been integrated into InterPro, and HAMAP rule-based 
annotation is now applied in a fully automated fashion 
to UniProtKB/TrEMBL records. Rules and conditions 
are interpreted in precisely the same way as before, and 
conditional annotations are applied only to those proteins 
that satisfy the relevant criteria. The set of HAMAP rules 
is also being combined with annotation rules from 
RuleBase (11,12) and P1R (28) into a single automatic 
annotation system for UniProtKB/TrEMBL, UniRule, 
which will be the subject of a forthcoming publication 
by the UniProt consortium. Although HAMAP rules 
will be part of a larger integrated UniRule system, we 
will continue to maintain the HAMAP protein 
family profiles as a basis for protein classification and 
rule-based annotation within UniRule. Together, these 
developments will help leverage the experimental annota- 
tion and manual curation effort from UniProtKB/ 
Swiss-Prot into UniProtKB/TrEMBL, providing func- 
tional annotation for sequences for which no experimental 
data exists. 



Extension of HAMAP to eukaryotes 

The original scope of the HAMAP system was largely 
determined by the taxonomic distribution of the 
complete genomes that were available at the time of its 
inception. As more genomes from other taxonomic groups 
such as eukaryotes have become available in UniProtKB 
(6), through pipelines importing sequences from resources 
such as Ensembl (29), we have begun to observe an 
ever-increasing number of matches to existing HAMAP 
families in these genomes. We have therefore extended 
the scope of HAMAP families and annotation rules 
to include proteins from eukaryotic species, and annota- 
tions derived from these rules have been available in 
UniProtKB since UniProt release 2012 09 of October 
2012. 



Updates to the website 

HAMAP family profiles and their associated annotation 
rules are made available as independent pages on the 
HAMAP website. As more than one annotation rule can 
be triggered by a single HAMAP family profile, each rule 
is assigned a distinct page, and each of these is linked to 
the 'trigger' profile. A typical HAMAP profile page 
provides, in addition to the profile itself, relevant informa- 
tion such as a family name and description, taxonomic 
range (as a list of matching superkingdoms), associated 
annotation rule(s) and cross-references to InterPro, as 
well as information on the score distribution of 
matching proteins, including those that fall below the 
trusted cutoff (Figure 1). In line with these changes, we 
have also redesigned the web view of the annotation rules 
and added new options for searching and accessing the 
collection of annotation rules. As well as listing all rules 
by taxonomic scope, enzyme class, pathway, feature key 
or keywords, it is now also possible to browse the anno- 
tation rules by GO terms. These GO annotations are 
also available for download on the UniProt-GO 
Annotation database ftp site (see ftp://ftp.ebi.ac.uk/pub/ 
databases/GO/goa/UNIPROT/). 

HAMAP STATISTICS AND AVAILABILITY 

As of release 2012_08 of UniProt, HAMAP contains 1780 
family classification profiles and 1720 annotation rules. 
The family profiles cover 2 317 216 UniProtKB entries, 
which is close to 10% of all sequences in UniProtKB. 
Considering only the 1696 complete prokaryotic prote- 
omes of UniProtKB, the coverage of HAMAP is around 
14% of an 'average' prokaryotic proteome. The precise 
figure may vary considerably depending on our knowledge 
of the organism, the degree to which it has been studied, 
and the size of its genome, being around 25% for the 
model organism Escherichia coli, and reaching 64% for 
the reduced genome of Buchnera aphidicola. Coverage is 
dependent on the number of available rules, and we are 
continuing to add new profiles and rules to further 
improve the coverage of proteins by the HAMAP 
system. While HAMAP annotations are made available 
through UniProtKB, HAMAP family profiles and rules 
can also be used directly for the annotation of protein 
sequences through our web interface at http://hamap. 
expasy.org/hamap_scan.html. Users may submit individ- 
ual protein sequences or complete microbial proteomes to 
be scanned against the entire collection of HAMAP 
profiles and annotated by HAMAP rules. 

CONCLUDING REMARKS 

We describe the extension of the scope of the HAMAP 
system of family classification and annotation to eukary- 
otic proteins and its application in the fully automatic 
annotation of the unreviewed section of the UniProt 
knowledgebase, UniProtKB/TrEMBL. These changes 
were implemented without compromising the quality of 
the annotations produced, which remains equal to that 
of manually curated UniProtKB/Swiss-Prot records. 
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Figure 1. A sample HAMAP profile page. The page provides information such as a family name and description, taxonomic range of the hits, 
associated annotation rule(s), cross-references to InterPro and access to matching proteins in UniProtKB. Additionally, links on the page provide 
access to (a) the actual family classification profile, (b) the seed alignment that was used to generate the profile with highlighted features from the 
annotation rule, (c) an interactive, graphical view of the score distribution of matching proteins, including those that fall below the trusted cutoff, 
and (d) an expandable view of the taxonomic distribution of matching proteins in UniProtKB. 



HAMAP annotation rules include numerous checks (or 
conditions) that must be satisfied for annotation propaga- 
tion to proceed, ensuring high specificity of the annota- 
tions produced. This design feature is intended to reduce 



the likelihood of over-annotation, a relatively common 
error in some automated pipelines (30). In the near 
future, the HAMAP annotation rules will be made 
available as one element of an integrated system of 
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automatic annotation for UniProtKB/TrEMBL, 
UniRule. This will be described in a future publication 
by the UniProt consortium. In the context of UniRule, 
we will continue to maintain the HAMAP protein family 
profiles as a basis for protein classification and the devel- 
opment of new annotation rules as new functions are 
discovered. 
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